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Abstract 

We propose a compression-based version of the empirical entropy of a finite string over a finite 
alphabet. Whereas previously one considers the naked entropy of (possibly higher order) Markov 
processes, we consider the sum of the description of the random variable involved plus the entropy 
it induces. We assume only that the distribution involved is computable. To test the new notion we 
compare the Normalized Information Distance (the similarity metric) with a related measure based on 
Mutual Information in Shannon's framework. This way the similarities and differences of the last two 
concepts are exposed. 

Index Terms — Empirical entropy, Kolmogorov complexity, normalized information distance, simi- 
larity metric, mutual information distance 

I. Introduction 

In the basic set-up of Shannon [20] a message is a finite string over a finite alphabet. One is interested 
in the expected number of bits to transmit a message from a sender to a receiver, when both the sender 
and the receiver consider the same ensemble of messages (the set of possible messages provided with a 
probability for each message). The expected number of bits is known as the entropy of the ensemble of 
messages. This ensemble is also known as the source. 

The empirical entropy of a single message is taken to be the entropy of a source that produced it as a 
typical element. (The notion of "typicality" is defined differently by different authors and we take here the 
intuitive meaning.) Traditionally, this source is a (possibly higher order) Markov process. This leads to 
the definition in Example 12.41 Here we want to liberate the notion so that it encompasses all computable 
random variables with finitely many outcomes consisting of finite strings over a finite alphabet. Moreover, 
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since we are given only a single message, but not the ensemble from which it is an element, the new 
empirical entropy should provide both this ensemble and the entropy it induces. If we are given just 
the entropy but not the ensemble involved, then a receiver cannot in general reconstruct the message. 
Moreover, we are given a single message which has a particular length, say n. Therefore, given the 
family of random variables we draw upon, we can select one of them and compute the probability of 
every message of length n. For fixed n, this results in a Bernoulli variable that has |S| ra outcomes. 

We are thus led to a notion of empirical entropy that consists of a description of the Bernoulli variable 
involved plus the related entropy of the message induced. Since we assume the original probability mass 
function to be computable, the Bernoulli variable is computable and its effective description length can 
be expressed by its Kolmogorov complexity. 

Normalized Information Distance (explained below) between two finite objects is often confused with 
a similar distance between two random variables. The last distance is expressed in terms of probabilistic 
mutual information. We use our new notion to explain the differences between the former distance between 
two individual objects and the latter distance between two random variables. This difference parallels 
that between the Kolmogorov complexity of a single finite object and the entropy of a random variable. 
The former quantifies the information in a finite object, while the latter gives us the expected number 
of bits to communicate any outcome of a random variable known to both the sender and the receiver. 
Computability notions are reviewed in Appendix |A] and Kolmogorov complexity in Appendix |B] 

A. preliminaries 

We write string to mean a finite string over a finite alphabet S. Other finite objects can be encoded 
into strings in natural ways. The set of strings is denoted by £*. We usually take £ = {0, 1}. The length 
of a string x is the number of letters in £ in it denoted as \x\. The empty string e has length |e| = 0. 
Identify the natural numbers AT (including 0) and {0, 1}* according to the correspondence 

(0,e), (1,0), (2,1), (3, 00), (4, 01),.... (U) 

Then, |010| = 3. The emphasis here is on binary sequences only for convenience; observations in every 
finite alphabet can be so encoded in a way that is 'theory neutral.' For example, if a finite alphabet £ 
has cardinality 2 k , then every element i G S can be encoded by a(i) which is a block of bits of length 
k. With this encoding every x G S* satisfies that the Kolmogorov complexity K(x) = K(a(x)) (see 
Appendix |B] for basic definitions and results on Kolmogorov complexity) up to an additive constant that 
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is independent of x. 



II. The New Empirical Entropy 
Let X be a random variable with outcomes in a finite alphabet X. Shannon's entropy ll20l is 

H(X) = p ( x = x ) logl/P(X = x). 

There are three items involved in the new empirical entropy of data x: 

• A class of random variables like the set of Bernoulli processes, or the set of higher order Markov 
processes; from each element of this class we construct a Bernoulli variable X with |S| n outcomes 
of length n; 

• a selection of a random variable from this Bernoulli class such that x is a typical outcome, and 

• a description of this random variable plus its entropy. 

This is reminiscent of universal coding essentially due to Kolmogorov IfTD . and of two-part MDL due 
to Rissanen fl9ll . In its simplest form the former, assuming a Bernoulli process, codes a string x of 
length n over a finite alphabet S as follows: A string containing a description of n, |S| and n/rii 
(1 < i < |S|), and the index of x in the set constrained by these items. The coding should be such that 
the individual substrings can be parsed, except the description of the index which we put last. This takes 
additive terms that are logarithmic in the length of the items except the last one. The universal code 
takes 0(|S| logn) + ( n / ni .. n n / n|s| ) bits. The two-part MDL complexity of a string |[T9Tl . is the minimum 
of the self-information of that string with respect to a source and the number of bits needed to represent 
that source. The source is not required to be Markovian and the two-part MDL takes into account its 
complexity. However, the methods of encoding are arbitrary. 

An n-length outcome x = x\, x%, ■ ■ ■ , x n over £ is the outcome of a stochastic process X±, X2, ■ ■ ■ , X n 
characterized by a joint probability mass function Pr({Xi, X2, ■ ■ ■ , X n ) = (x\, x%, . . . ,x n )}. For 
technical reasons we replace the list X\,X%,..., X n by a single Bernoulli random variable X with 
outcomes in X = S n . Here, the random variables Xi may be independent copies of a single random 
variable as is the case wen the source stochastic process is a Bernoulli variable. But the source stochastic 
process may be a higher order Markov chain making some or all X,s dependent (this depends on whether 
the order of the Markov chain is greater then n). For certain stochastic processes all XjS are dependent 
for every n: the stochastic process assigns a probability to every outcome in S*. 
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Definition 2.1: Let n be an integer, S a finite alphabet, x € S n be a string, <Y a family of computable 
processes, each process S S ^ producing (possibly by repetition) a sequence of (possibly dependent) 
random variables X = Xi, X2, ■ ■ ■ , X n , with Pr(X = x) is computable and H(X) < 00. The empirical 
entropy of x with respect to X is given by 

H(x\X) = mm{K(X) + H(X) : \H{X) -logl/Pr(X = x)\ is minimal}. 

This means that the expected binary length of encoding an outcome of X is as close as possible to 
logl/PrpT = x). In the two-part description the complexity part describes X, and the entropy part is 
the ignorance about the data x in the set S n given X. 

Remark 2.2: By assumption n is fixed. By Theorem 3 in |20l , i.e. the asymptotic equidistribution 
property, for ergodic Markov sources the following is the case. Let H be the per symbol entropy of the 
source. For example, if the source E is Bernoulli with Pr(H = Si) = p(si) (si G S for 1 < i < |£|), then 
H = 2~2i=iPi s i) 1°§ 1/p( s «)- Let X be the induced Bernoulli variable with |S| n outcomes consisting of 
sequences of length n over S. Then, for every e, 5 > there is an no such that the sequences of length 
n > no are divided into two classes: one set with total probability less than e and one set such that for 
every y in this set holds \H — i log l/Pr(X = y)\ < 5. Note that H(X) = nH. Thus, for large enough 
n we are almost certain to have \H (X) — log 1/ Pr(X = x)\ = o(n). 

Set e = 5 for convenience. We call the set of y's such that \H(X) — logl/Pr(X = y)\ = en, with 
e > and some no depending on e and n > no, the e-typical outcomes of X. The cardinality of the set 
S C S n of such y's satisfies 

(1 - e)|£| H(x) - £n < \S\ < 

See (3 Theorem 3.1.2. <> 

Lemma 2.3: Assume Definition O Then, K(X) < K(x,X) +0(1). 

Proof: The family X consists of computable random variables, that is, in essence of computable 
probability mass functions. The family of all lower semicomputable semiprobability mass functions can 
be effectively enumerated, possibly with repetitions, Theorem 4.3.1 in ifTTl . The latter family contains 
all computable probability mass functions, hence it contains X. Thus, if we know x, X we can compute 
the X S X of Definition 12.11 by going through this list. ■ 

Example 2.4: Assume Definition 12. II Let ri{ be the number of occurrences of the ith character of S in 
x. If w is a string then x w is the string obtained by concatenating the characters immediately following 
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occurrences of w in x. The cardinality \x w \ is the number of occurrences of w in x unless w occurs as 
a suffix of x in which case it is 1 less. In lfl~2l . lfl~8ll . I8l the kth order empirical entropy of x is defined 



The kth order empirical entropy of x can be reconstructed from x once we know k. The kth order 
empirical entropy of x results from the probability induced by a kth order Markov source S 6 X. (A 
Bernoulli process is a Oth order Markov source.) 

Let to be the family of fcth order Markov sources (a specific k > 0), provided the transition 
probabilities are computable. Such a family is subsumed under Definition 12.11 Let x be a string over 
£ which is typically produced by such a Markov source of order k. The empirical entropy H(x\X) of 
x is K(X) + nHk(x). Here X is the random variable associated with the kth order empirical entropy 
computed from x. Note that the empirical entropy H^{x) stops being a reasonable complexity metric for 
almost all strings roughly when surpasses n, (H. 

Example 2.5: Let x = (10)"/ 2 for even n (that is, n/2 copies of the pattern "10")- Let X\ be the 
family of binary Bernoulli processes. The empirical entropy H(x\X\) is reached for i.i.d. sequence 
X = X\ , X% , . . . , X n € X\ , each Xi being a copy of the same random variable Y with outcomes in 
{0, 1} with P(Y = 1) = i. Then, H{x\X{) = K(X) + nH(Y). Then X can be computed from the 
information concerning n in O(logn) bits, the particular 5 G Af used in O(l) bits, and a program of 
O(l) bits to compute X from this information. In this way K{X) = O(logn). Moreover, H(Y) = 1, so 
that H(x\Xi) = n + O(logn). 

Let be the family of first order Markov processes with 2 transitions each and with output alphabet 
{0, 1} for each state. The empirical entropy H{x\X-i) is reached for the ra-bit output of a deterministic 
"parity" Markov process. That is, X = X\, X2, ■ ■ ■ ,X n and every Xi gives the output at time i of the 
Markov process with 2 states so and si defined as follows. The transit probabilities are p(so — > si) = 1 
and p(si — )• so) = 1, while the output in state sq is and in state si is 1. The start state is sq. In this way, 
P(X = (10) n / 2 ) = 1 while H(X) = 0. Then, H(x\X 2 ) = K(X) + H(X). Here i^(X) = O(logn), 
since we require a description of n, the 2-state Merkov process involved, and a program to compute X 
from this information. Since the outcome is deterministic, H(X) = 0, so that H(x\X2) = O(logn). 

Example 2.6: Consider the first n bits of 7r = 3.1415 .... Let X\ be the family of Bernoulli processes. 
Empirically, it has been established that the frequency of l's in the binary expansion of it is n/2±0(y / n), 
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that is, the binary expaqnsion of ir is a typical pseudorandom sequence. Hence, H(x\X\) = K{X) + 
nH(X) where X = X±,X2, ■ ■ ■ ,X n G X\ and the JQ's are n i.i.d. distributed copies of Y. Here 
Y is a Bernoulli process with P(Y = 1) = i. Then K(X) = O(logn) and = 1, so that 

H(x\X!) = ra + 0(logra). 

Let be the family of computable random variables with as outcomes binary strings of length 
n. We know that there is a small program, say of about 10, 000 bits, incorporating an approximation 
algorithm that generates the successive bits of ir forever. Telling it to stop after n bits, we can generate the 
computable Bernoulli variable X G Xi assigning probability 1 to x and probability to any other binary 
string of length n. Assume n = 1, 000, 000, 000. Then, we have K(X) < log 1, 000, 000, 000 + c « 30 + c 
where the c additive term is the number of bits of the program to compute 7r and a program required 
to turn the logarithmic description of 1,000,000,000 and the program to compute tt into the random 
variable X. Finally, H(X) = 0. Therefore, H(x\X 2 ) < 10,030 + c. 

Example 2. 7: Consider printed English, say just lower case and space signs, ignoring the other signs. 
The entropy of representative examples of printed English has been estimated experimentally by Shannon 
||2T1 based on human subjects guesses of successive characters in a text. His estimate is between 0.6 
and 1.3 bits per character (bpc), and [22] obtained an estimate of 1.46 bpc for PPM based models, 
which we will use in this example. PPM (prediction by partial matching) is an adaptive statistical data 
compression technique. It is based on context modeling and prediction and uses a set of previous symbols 
in the uncompressed symbol stream to predict the next symbol in the stream, rather like a mechanical 
version of Shannon's method. Consider a text of n characters over the alphabet used by 11221 . and let 
X be the class of PPM based models with n output characters over the used alphabet. Since the PPM 
machine can be described in 0(1) bits (its program is finite) and the length n in O(logn) bits, we have 
K(X) = O(logn). Hence, H(x\X) < K{X) + 1.46n = 1.46n + O(logra). 

In these examples we see that the empirical entropy is higher when the family of random variables 
considered is simpler. For simple random variables the knowledge in the Kolmogorov complexity part is 
neglible. The empirical entropy with respect to a complex family of random variables can be lower than 
that with respect to a family of simple random variables by transforming the ignorance in the entropy 
part into knowledge in the Kolmogorov complexity part. We use this observation to consider the widest 
family of computable probability mass functions. 

Lemma 2.8: Let X be the family of computable random variables X with H(X) < 00, and x G S* 
with |S| < 00. Then, H{x\X) = K(x) + 0(1). 
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Proof: First, let p x be a shortest prefix program which computes x. Hence \p x \ = K{x). By adding 

0(1) bits to it we have a program p p which computes a probability mass function p with p(x) = 1 and 

p(y) = for y / x (x,y G £*). Hence \p p \ < K(x) + 0(1). 

Second, let q p be a shortest prefix program which computes a probability mass function p with p(x) = 1 

and p(y) = for y ^ x (x, y G X*). Thus, < |p p |. Adding 0(1) bits to q p we have a program q x 

which computes x. Then, K(x) < \q p \ + 0(1). 

Altogether, \q p \ = K(x) + O(l). ■ 
For the sequel of this paper, we need to extend the notion of empirical entropy to joint probability 

mass functions. 

Definition 2.9: Let n be an integer, S a finite alphabet, x,y G S n be strings, 2 be the family of 
computable joint probability mass functions, Z £ Z and (x, y) an outcome of Z. Let the probability 
mass function p(x,y) = P(Z = (x,y)) have a finite joint entropy H(Z) < oo. The empirical entropy 
of (x, y) with respect to Z is 

H(x,y\Z) = min{K(Z) + H(Z) : \H(Z) — logl/p(x,y)\ is minimal}. 

Lemma 2.10: Let i? be the family of computable joint probability mass functions Z with H(Z) < oo, 
and x, y G £* with |S| < oo. Then, H(x, y\Z) = K(x, y) + O(l). 

Proof: Similar to that of Lemma 12.81 ■ 

III. Normalized Information Distance 

The classical notion of Kolmogorov complexity [11] is an objective measure for the information in a 
single object, and information distance measures the information between a pair of objects [2]. This last 
notion has spawned research in the theoretical direction, see the many Google Scholar citations to the 
above reference. Research in the practical direction has focused on the normalized information distance 
(NID), also called "the similarity metric," which arises by normalizing the information distance in a 
proper manner. (The NID is defined by (IIII.2I ) below.) 

If we approximate the Kolmogorov complexity through real-world compressors ||T6l , ||6], |@J, then we 
obtain the normalized compression distance (NCD) from the NID. This is a parameter-free, feature-free, 
and alignment-free similarity measure that has had great impact in applications. (Only the compressor 
used can be viewed as a parameter or feature.) The NCD was preceded by a related nonoptimal distance 
|TT31 . In iflOl another variant of the NCD has been tested on all major time-sequence databases used in 
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all major data-mining conferences against all other major methods used. The compression method turned 
out to be competitive in general and superior in heterogeneous data clustering and anomaly detection. 

There have been many applications in pattern recognition, phylogeny, clustering, and classification, 
ranging from hurricane forecasting and music to to genomics and analysis of network traffic, see the 
many papers referencing |fl6l , ||6], [@] in Google Scholar. In fT6l it is shown that the NID, and in [4] that 
the NCD subject to mild conditions on the used compressor, are metrics up to negligible discrepancies 
in the metric (in)equalities and that they are always between and 1. The computability status of the 
NID has been resolved in [23]. The NCD is computable by definition. 

The information distance D(x,y) between strings x and y is defined as 

D(x, y) = min{|p| : U(p,x) = y AU (p, y) = x}, 
p 

where U is the reference universal Turing machine above. Like the Kolmogorov complexity K, the 
distance function D is upper semicomputable. Define 

E(x,y) = m&x{K(x\y),K(y\x)}. 

In O it is shown that the function E is upper semicomputable, D(x, y) = E{x, y) + 0(log E(x, y)), the 
function E is a metric (more precisely, that it satisfies the metric (in)equalities up to a constant), and that 
E is minimal (up to a constant) among all upper semicomputable distance functions D' satisfying the 
mild normalization conditions Ylyy^x — anc ^ Ylx-x^y — 1- (Here and elsewhere in 

this paper "log" denotes the binary logarithm.) The normalized information distance (NID) e is defined 
by 

E(x,y) 

< x >y) = TWi \ \c( w (IIL1) 

It is straightforward that < e(x, y) < 1 up to some minor discrepancies for all x, y G {0, 1}*. Rewriting 
e using (1A.11 ) yields 

e(,y) = ^^f«, (IH.2) 

up to some lower order terms that we ignore. 

Lemma 3.1: Let x be a string, X, Z be the families of random variables with computable probability 
mass functions and computable joint probability mass functions, respectively. Moreover, for X S X and 
Z E Z we have H(X), H(Z) < oo. Then, we can substitute the Kolmogorov complexities in (IIII.2I) by 
the corresponding empirical entropies as in (HII.3) . 



Proof: By Lemma's 12.81 and 12.101 we know the following. For X is the family of computable 
probability mass functions, H(x\X) = K(x), H(y\X) = K(y). For Z is the family of computable joint 
probability mass functions, H(x,y\Z) = K(x,y). Hence, 

( , _ H(x,y\Z) -min{H(x\X),H(y\X)} 
e ^ y >- max{H(x\X),H(y\X)} ' { ^ 

ignoring lower order terms. ■ 
Remark 3.2: In Lemma 13.11 we can replace the computable random variables by the restriction to 
computable random variables that have a singleton support, that is, probability mass functions p with 
p(x) = 1 for some x and p{y) = for all y ^ x. Alternatively, we can replace it by the family of 
computable Markov processes. To see this, for every x of length n there is a computable Markov process 
M of order n — 1 that outputs x deterministic ally and K(x) = K(M) + 0(1). 

Clearly, if we replace the family of computable probability mass functions in the empirical entropies in 
Lemma [3~T1 by weaker subfamilies like the families based on computable Bernoulli functions, computable 
Gaussians, or computable first order Markov processes, then Lemma 13.11 will not hold in general. 
Remark 3.3: The NCD is defined by 

J\ICD z (x,y) = , (111.4) 

max{\Z{x)\, \Z(y)\\ 

where Z(x) is the compressed version of x when it is compressed by a lossless compressor Z. We 
have substituted xy for the pair (x, y) both for convenience and with ignorable consequences. Consider 
a simple compressor that uses only Bernoulli variables, for example a Huffman code compressor. The 
compressed version of a string is preceded by a header containing information identifying the compressor 
and the charcteristics used (the relative frequencies in this case) to compress the source string. In general 
this is the case with every compressor. (In [3] the NCD based on compressors computing the static 
Huffman code of a Bernoulli variable is shown to be the total Kullback-Leibler divergence to the mean. 
We refrain from explaining these terms since are extraneous to our treatment.) 

Thus, Z(x) is comprised of the header generated by Z for x. This header makes it possible to use 
the uncompress feature, denoted here by Z~ x so that Z~ l Z{x) = x. The header describes a random 
variable H based on the compressor Z. The family of random variables induced by the compressor Z 
can be denoted by Xz- 

In this way, we can define the Bernoulli variable X used to compress x. The empirical entropy 
H(x\Xz) = K(X) + H(X). Here K{X) is uncomputable. We approximate it by the length of the header, 



say |a(X)|. The Bernoulli variable X has entropy H(X) and |2T(x)| = |a(X)| + H(X). Similarly for 
y and (x,y). Therefore, 

Mrn ( , HXY)\ + HjX, Y) - min{|g(X)| + H(X), \a(Y)\ + H(Y)} 

NCDz ^ y) = ^{\a(X)\ + H(X),\a(Y)\ + H(Y)} ' ^ 

ignoring lower order terms, where \a(X)\ > K(X), \a(Y)\ > K(Y), and \a(XY)\ > K(XY). 



IV. Mutual Information 

In ||25l , |[T1l , |fT3l , 191 , 1126*1 , lfl"4l the entropy and joint entropy of a pair of sequences is determined, 
and this is directly equated with the Kolmogorov complexity of those sequences. The Shannon type 
probabilistic version of (IIII.2I ) is 

H(X, Y) - mm{H(X),H(Y)} 



e H {X,Y) 



m&x{H(X),H(Y)} 
{ _ m&x{H(X),H(Y)} - H(X, Y) + mm{H(X), H(Y)} 
max{H(X),H(Y)} 

I(X;Y) 



m^Lx{H(X),H(Y)} , 
since the mutual information I(X; Y) between random variables X and Y is 

I(X; Y) = H(X) + H(Y) - H(X, Y), 

and 

max{H(X), H(Y)} + min{iJ(X), H(Y)} = H(X) + H(Y). 

In this way, en(X,Y) is 1 minus the mutual information between random variables X and Y per bit of 
the maximal entropy. How do the cited references connect this distance between two random variables 
to (1HI.2I) . the distance between two individual outcomes x and yl 

Ostensibly one has to replace the entropy of random variables X and Y by the empirical entropy 
according to Definition 12.11 deduced from strings x and y. To obtain the required result (HII.21 one has 
to use families X, y, Z of computable random variables such that K[x) = H(x\X), K(y) = H(y\y), 
and K(x,y) = H(x,y\Z). In our framework this is possible only if X,y are appropriate families of 
computable random variables, and Z is an appropriate family of computable joint random variables. 
Outside our framework the widest notion of empirical entropy is dH.ll ) and there it is not possible at all. 
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To obtain computable approximations using a real-world compressor Z for x and y as in (1111.4b we 
can take the empirical entropy based on compressor Z as in (1111.4b and (MI.5I) . 



Appendix 

A. Computability 

In 1936 A.M. Turing ll24l denned the hypothetical 'Turing machine' whose computations are intended 
to give an operational and formal definition of the intuitive notion of computability in the discrete domain. 
These Turing machines compute integer functions, the computable functions. By using pairs of integers 
for the arguments and values we can extend computable functions to functions with rational arguments 
and/or values. The notion of computability can be further extended, see for example ifTTl : A function 
/ with rational arguments and real values is upper semicomputable if there is a computable function 
4>(x, k) with x an rational number and k a nonnegative integer such that <p(x, k + 1) < <j)(x, k) for 
every k and lim^oo <p(x, k) = f(x). This means that / can be computably approximated from above. A 
function / is A function / is lower semicomputable if — / is upper semicomputable. A function is called 
semicomputable if it is either upper semicomputable or lower semicomputable or both. If a function / 
is both upper semicomputable and lower semicomputable, then / is computable. A countable set S is 
computably (or recursively) enumerable if there is a Turing machine T that outputs all and only the 
elements of S in some order and does not halt. A countable set S is decidable (or recursive) if there is 
a Turing machine T that decides for every candidate a whether a £ S and halts. 

Example A. 1: An example of a computable function is /(n) defined as the nth prime number; an 
example of a function that is upper semicomputable but not computable is the Kolmogorov complexity 
function K in Appendix |B] An example of a recursive set is the set of prime numbers; an example of a 
recursively enumerable set that is not recursive is {x G M : K(x) < \x\}. 

B. Kolmogorov Complexity 

Informally, the Kolmogorov complexity of a string is the length of the shortest string from which 
the original string can be losslessly reconstructed by an effective general-purpose computer such as a 
particular universal Turing machine U, [11] or the text IfTTl . Hence it constitutes a lower bound on how 
far a lossless compression program can compress. In this paper we require that the set of programs of 
U is prefix free (no program is a proper prefix of another program), that is, we deal with the prefix 
Kolmogorov complexity. (But for the results in this paper it does not matter whether we use the plain 
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Kolmogorov complexity or the prefix Kolmogorov complexity.) We call U the reference universal Turing 
machine. Formally, the conditional prefix Kolmogorov complexity K(x\y) is the length of the shortest 
input z such that the reference universal Turing machine U on input z with auxiliary information y outputs 
x. The unconditional prefix Kolmogorov complexity K(x) is defined by K(x\e). The functions K (•) and 
K (• | •), though defined in terms of a particular machine model, are machine-independent up to an additive 
constant and acquire an asymptotically universal and absolute character through Church's thesis, see for 
example 1 17 ], and from the ability of universal machines to simulate one another and execute any effective 
process. The Kolmogorov complexity of an individual finite object was introduced by Kolmogorov |[TT1l 
as an absolute and objective quantification of the amount of information in it. The information theory 
of Shannon EDI , on the other hand, deals with average information to communicate objects produced 
by a random source. Since the former theory is much more precise, it is surprising that analogs of 
theorems in information theory hold for Kolmogorov complexity, be it in somewhat weaker form. For 
example, let X and Y be random variables with a joint distribution. Then, H(X,Y) < H(X) + H(Y), 
where H(X) is the entropy of the marginal distribution of X. Similarly, let K(x,y) denote K((x,y)) 
where (•,•) is a standard pairing function and x, y are binary strings. An example is (x,y) defined by 
y + (x + y + l)(x + y)/2 where x and y are viewed as natural numbers as in (II. It . Then we have 
K(x,y) < K(x) + K{y) + 0(1). Indeed, there is a Turing machine Tj that provided with (p,q) as an 
input computes (U(p), U(q)) (where U is the reference Turing machine). By construction of Tj, we have 
Ki(x, y) < K(x) + K(y), hence K(x, y) < K(x) + K{y) + O(l). 

Another interesting similarity is the following: I(X; Y) = H(Y) — H(Y \ X) is the (probabilistic) 
information in random variable X about random variable Y . Here H(Y \ X) is the conditional entropy 
of Y given X. Since I(X;Y) = I(Y;X) we call this symmetric quantity the mutual (probabilistic) 
information. 

Definition A.2: The (algorithmic) information in x about y is I(x : y) = K(y) — K(y \ x), where x,y 
are finite objects like finite strings or finite sets of finite strings. 

It is remarkable that also the algorithmic information in one finite object about another one is symmetric: 
I(x : y) = I(y : x) up to an additive term logarithmic in K(x) + K(y). This follows immediately from 
the symmetry of information property due to A.N. Kolmogorov and L.A. Levin (they proved it for plain 
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Kolmogorov complexity but in this form it holds equally for prefix Kolmogorov complexity): 



K{x, y) = K{x) + K(y \ x) + 0{\og(K{x) + K{y))) (A.l) 
= K(y) + K(x | y) + 0(\og{K(x) + K(y))). 
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